70 research outputs found
Quality Assessment of Linked Datasets using Probabilistic Approximation
With the increasing application of Linked Open Data, assessing the quality of
datasets by computing quality metrics becomes an issue of crucial importance.
For large and evolving datasets, an exact, deterministic computation of the
quality metrics is too time consuming or expensive. We employ probabilistic
techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient
estimation for implementing a broad set of data quality metrics in an
approximate but sufficiently accurate way. Our implementation is integrated in
the comprehensive data quality assessment framework Luzzu. We evaluated its
performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding
Sampled Weighted Min-Hashing for Large-Scale Topic Mining
We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to
automatically mine topics from large-scale corpora. SWMH generates multiple
random partitions of the corpus vocabulary based on term co-occurrence and
agglomerates highly overlapping inter-partition cells to produce the mined
topics. While other approaches define a topic as a probabilistic distribution
over a vocabulary, SWMH topics are ordered subsets of such vocabulary.
Interestingly, the topics mined by SWMH underlie themes from the corpus at
different levels of granularity. We extensively evaluate the meaningfulness of
the mined topics both qualitatively and quantitatively on the NIPS (1.7 K
documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora.
Additionally, we compare the quality of SWMH with Online LDA topics for
document representation in classification.Comment: 10 pages, Proceedings of the Mexican Conference on Pattern
Recognition 201
Interval Selection in the Streaming Model
A set of intervals is independent when the intervals are pairwise disjoint.
In the interval selection problem we are given a set of intervals
and we want to find an independent subset of intervals of largest cardinality.
Let denote the cardinality of an optimal solution. We
discuss the estimation of in the streaming model, where we
only have one-time, sequential access to the input intervals, the endpoints of
the intervals lie in , and the amount of the memory is
constrained.
For intervals of different sizes, we provide an algorithm in the data stream
model that computes an estimate of that, with
probability at least , satisfies . For same-length
intervals, we provide another algorithm in the data stream model that computes
an estimate of that, with probability at
least , satisfies . The space used by our algorithms is bounded
by a polynomial in and . We also show that no better
estimations can be achieved using bits of storage.
We also develop new, approximate solutions to the interval selection problem,
where we want to report a feasible solution, that use
space. Our algorithms for the interval selection problem match the optimal
results by Emek, Halld{\'o}rsson and Ros{\'e}n [Space-Constrained Interval
Selection, ICALP 2012], but are much simpler.Comment: Minor correction
Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search
The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a
general technique for constructing a data structure to answer approximate near
neighbor queries by using a distribution over locality-sensitive
hash functions that partition space. For a collection of points, after
preprocessing, the query time is dominated by evaluations
of hash functions from and hash table lookups and
distance computations where is determined by the
locality-sensitivity properties of . It follows from a recent
result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive
hash functions can be reduced to , leaving the query time to be
dominated by distance computations and
additional word-RAM operations. We state this result as a general framework and
provide a simpler analysis showing that the number of lookups and distance
computations closely match the Indyk-Motwani framework, making it a viable
replacement in practice. Using ideas from another locality-sensitive hashing
framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of
additional word-RAM operations to .Comment: 15 pages, 3 figure
The Bloom Clock for Causality Testing
Testing for causality between events in distributed executions is a
fundamental problem. Vector clocks solve this problem but do not scale well.
The probabilistic Bloom clock can determine causality between events with lower
space, time, and message-space overhead than vector clock; however, predictions
suffer from false positives. We give the protocol for the Bloom clock based on
Counting Bloom filters and study its properties including the probabilities of
a positive outcome and a false positive. We show the results of extensive
experiments to determine how these above probabilities vary as a function of
the Bloom timestamps of the two events being tested, and to determine the
accuracy, precision, and false positive rate of a slice of the execution
containing events in the temporal proximity of each other. Based on these
experiments, we make recommendations for the setting of the Bloom clock
parameters. We postulate the causality spread hypothesis from the application's
perspective to indicate whether Bloom clocks will be suitable for correct
predictions with high confidence. The Bloom clock design can serve as a viable
space-, time-, and message-space-efficient alternative to vector clocks if
false positives can be tolerated by an application
Functional limit theorems for random regular graphs
Consider d uniformly random permutation matrices on n labels. Consider the
sum of these matrices along with their transposes. The total can be interpreted
as the adjacency matrix of a random regular graph of degree 2d on n vertices.
We consider limit theorems for various combinatorial and analytical properties
of this graph (or the matrix) as n grows to infinity, either when d is kept
fixed or grows slowly with n. In a suitable weak convergence framework, we
prove that the (finite but growing in length) sequences of the number of short
cycles and of cyclically non-backtracking walks converge to distributional
limits. We estimate the total variation distance from the limit using Stein's
method. As an application of these results we derive limits of linear
functionals of the eigenvalues of the adjacency matrix. A key step in this
latter derivation is an extension of the Kahn-Szemer\'edi argument for
estimating the second largest eigenvalue for all values of d and n.Comment: Added Remark 27. 39 pages. To appear in Probability Theory and
Related Field
- …